AITopics | linearized network

Residual connections remain ubiquitous in modern neural network architectures nearly a decade after their introduction. Their widespread adoption is often credited to their dramatically improved trainability: residual networks train faster, more stably, and achieve higher accuracy than their feedforward counterparts. While numerous techniques, ranging from improved initialization to advanced learning rate schedules, have been proposed to close the performance gap between residual and feedforward networks, this gap has persisted. In this work, we propose an alternative explanation: residual networks do not merely reparameterize feedforward networks, but instead inhabit a different function space. We design a controlled post-training comparison to isolate generalization performance from trainability; we find that variable-depth architectures, similar to ResNets, consistently outperform fixed-depth networks, even when optimization is unlikely to make a difference. These results suggest that residual connections confer performance advantages beyond optimization, pointing instead to a deeper inductive bias aligned with the structure of natural data.

artificial intelligence, feedforward network, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2506.14386

Country:

North America > United States (1.00)
Europe (1.00)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Initialization Matters: Privacy-Utility Analysis of Overparameterized Neural Networks

Ye, Jiayuan, Zhu, Zhenyu, Liu, Fanghui, Shokri, Reza, Cevher, Volkan

arXiv.org Machine LearningOct-31-2023

We analytically investigate how over-parameterization of models in randomized machine learning algorithms impacts the information leakage about their training data. Specifically, we prove a privacy bound for the KL divergence between model distributions on worst-case neighboring datasets, and explore its dependence on the initialization, width, and depth of fully connected neural networks. We find that this KL privacy bound is largely determined by the expected squared gradient norm relative to model parameters during training. Notably, for the special setting of linearized network, our analysis indicates that the squared gradient norm (and therefore the escalation of privacy loss) is tied directly to the per-layer variance of the initialization distribution. By using this analysis, we demonstrate that privacy bound improves with increasing depth under certain initializations (LeCun and Xavier), while degrades with increasing depth under other initializations (He and NTK). Our work reveals a complex interplay between privacy and depth that depends on the chosen initialization distribution. We further prove excess empirical risk bounds under a fixed KL privacy budget, and show that the interplay between privacy utility trade-off and depth is similarly affected by the initialization.

artificial intelligence, machine learning, privacy, (15 more...)

arXiv.org Machine Learning

2310.20579

Country:

Asia > Singapore (0.04)
North America > United States > New York > New York County > New York City (0.04)
Europe > Switzerland (0.04)
Europe > Croatia > Dubrovnik-Neretva County > Dubrovnik (0.04)

Genre: Research Report (0.82)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Nonlinear Advantage: Trained Networks Might Not Be As Complex as You Think

Mehmeti-Göpel, Christian H. X. Ali, Disselhoff, Jan

arXiv.org Artificial IntelligenceJun-1-2023

We perform an empirical study of the behaviour of deep networks when fully linearizing some of its feature channels through a sparsity prior on the overall number of nonlinear units in the network. In experiments on image classification and machine translation tasks, we investigate how much we can simplify the network function towards linearity before performance collapses. First, we observe a significant performance gap when reducing nonlinearity in the network function early on as opposed to late in training, in-line with recent observations on the time-evolution of the data-dependent NTK. Second, we find that after training, we are able to linearize a significant number of nonlinear units while maintaining a high performance, indicating that much of a network's expressivity remains unused but helps gradient descent in early stages of training. To characterize the depth of the resulting partially linearized network, we introduce a measure called average path length, representing the average number of active nonlinearities encountered along a path in the network graph. Under sparsity pressure, we find that the remaining nonlinear units organize into distinct structures, forming core-networks of near constant effective depth and width, which in turn depend on task difficulty.

artificial intelligence, machine learning, nonlinear advantage, (20 more...)

arXiv.org Artificial Intelligence

2211.1718

Country: